Automatically Building a Corpus for a Minority Language from the Web

نویسندگان

  • Rosie Jones
  • Rayid Ghani
چکیده

We present an approach to language-speciic query-based sampling which, given a single document in a target language, can nd many more examples of documents in that language, by automatically constructing queries to access such documents on the world wide web. We propose a number of methods for building search queries to quickly obtain documents in the target language. They perform accurately and eeciently for building a corpus of documents in Tagalog starting from a single seed document, when these documents are only 2.5% of the documents in a collection. We found that sampling with a query consisting of a word seleccted according to its probability from the minority language corpus constructed so far was very successful. This method built a corpus of documents with word frequencies similar to those in the corpus based on all Tagalog documents in our collection, and required a relatively small number of search queries. It also quickly acquired a good coverage of vocabulary terms. However, adding an element of randomness to the query may give greater coverage, although more queries are required.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

AUTOLEX: An Automatic Lexicon Builder for Minority Languages Using an Open Corpus

The aim of this study is to build natural language resources for languages with limited resources or minority languages. Manually building these resources is tedious and costly. These natural language resources such as a language corpora and lexicon will be used for natural language processing research and system development. Tagalog, a minority language was considered in this study as a test b...

متن کامل

Building Minority Language Corpora by Learning to Generate

The Web is an obvious source of valuable information but the process of collecting, organizing and utilizing these resources is diicult. We describe CorpusBuilder, an approach for automatically generating Web-search queries for collecting documents matching a minority concept. We use the concept of text documents belonging to a minority natural language on the Web. Individual documents are auto...

متن کامل

designing and implementing a 3D indoor navigation web application

​During the recent years, the need arises for indoor navigation systems for guidance of a client in natural hazards and fire, due to the fact that human settlements have been complicating. This research paper aims to design and implement a visual indoor navigation web application. The designed system processes CityGML data model automatically and then, extracts semantic, topologic and geometric...

متن کامل

Techniques of Ontology and its Usage in Indian Languages - A Review

Ontology is presently an emerging research topic in the field of artificial intelligence, semantic web, and natural language processing, software engineering, and information architecture etc. Manual Ontology building is essentially a time consuming and tedious task. From the last few decades, different ontology building approaches are being used to build ontology either semi-automatically or a...

متن کامل

Reinforcement-Based Web Crawler

This paper presents a focused web crawler system which automatically creates a minority language corpora. The system uses a database of relevant and irrelevant documents testing the relevance of retrieved web documents. The system requires a starting web document to indicate where the search would begin.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000